reward estimator
- North America > United States > California (0.14)
- North America > United States > Iowa (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Middle East > Malta (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Provably Efficient Interaction-Grounded Learning with Personalized Reward
Interaction-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting.
- North America > United States > California (0.14)
- North America > United States > Iowa (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Middle East > Malta (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Information Technology > Data Science > Data Mining > Big Data (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Provably Efficient Interactive-Grounded Learning with Personalized Reward
Zhang, Mengxiao, Zhang, Yuheng, Luo, Haipeng, Mineiro, Paul
Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.
- North America > United States > California (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
Multi-Fidelity Reinforcement Learning for Time-Optimal Quadrotor Re-planning
Ryou, Gilhyun, Wang, Geoffrey, Karaman, Sertac
High-speed online trajectory planning for UAVs poses a significant challenge due to the need for precise modeling of complex dynamics while also being constrained by computational limitations. This paper presents a multi-fidelity reinforcement learning method (MFRL) that aims to effectively create a realistic dynamics model and simultaneously train a planning policy that can be readily deployed in real-time applications. The proposed method involves the co-training of a planning policy and a reward estimator; the latter predicts the performance of the policy's output and is trained efficiently through multi-fidelity Bayesian optimization. This optimization approach models the correlation between different fidelity levels, thereby constructing a high-fidelity model based on a low-fidelity foundation, which enables the accurate development of the reward model with limited high-fidelity experiments. The framework is further extended to include real-world flight experiments in reinforcement learning training, allowing the reward model to precisely reflect real-world constraints and broadening the policy's applicability to real-world scenarios. We present rigorous evaluations by training and testing the planning policy in both simulated and real-world environments. The resulting trained policy not only generates faster and more reliable trajectories compared to the baseline snap minimization method, but it also achieves trajectory updates in 2 ms on average, while the baseline method takes several minutes.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- North America > United States > Virginia (0.04)
- (2 more...)
Why Guided Dialog Policy Learning performs well? Understanding the role of adversarial learning and its alternative
Shimoyama, Sho, Morimura, Tetsuro, Abe, Kenshi, Takamichi, Toda, Tomomatsu, Yuta, Sugiyama, Masakazu, Hentona, Asahi, Azuma, Yuuki, Ninomiya, Hirotaka
Dialog policies, which determine a system's action based on the current state at each dialog turn, are crucial to the success of the dialog. In recent years, reinforcement learning (RL) has emerged as a promising option for dialog policy learning (DPL). In RL-based DPL, dialog policies are updated according to rewards. The manual construction of fine-grained rewards, such as state-action-based ones, to effectively guide the dialog policy is challenging in multi-domain task-oriented dialog scenarios with numerous state-action pair combinations. One way to estimate rewards from collected data is to train the reward estimator and dialog policy simultaneously using adversarial learning (AL). Although this method has demonstrated superior performance experimentally, it is fraught with the inherent problems of AL, such as mode collapse. This paper first identifies the role of AL in DPL through detailed analyses of the objective functions of dialog policy and reward estimator. Next, based on these analyses, we propose a method that eliminates AL from reward estimation and DPL while retaining its advantages. We evaluate our method using MultiWOZ, a multi-domain task-oriented dialog corpus.
A Domain-Independent Agent Architecture for Adaptive Operation in Evolving Open Worlds
Mohan, Shiwali, Piotrowski, Wiktor, Stern, Roni, Grover, Sachin, Kim, Sookyung, Le, Jacob, De Kleer, Johan
Model-based reasoning agents are ill-equipped to act in novel situations in which their model of the environment no longer sufficiently represents the world. We propose HYDRA - a framework for designing model-based agents operating in mixed discrete-continuous worlds, that can autonomously detect when the environment has evolved from its canonical setup, understand how it has evolved, and adapt the agents' models to perform effectively. HYDRA is based upon PDDL+, a rich modeling language for planning in mixed, discrete-continuous environments. It augments the planning module with visual reasoning, task selection, and action execution modules for closed-loop interaction with complex environments. HYDRA implements a novel meta-reasoning process that enables the agent to monitor its own behavior from a variety of aspects. The process employs a diverse set of computational methods to maintain expectations about the agent's own behavior in an environment. Divergences from those expectations are useful in detecting when the environment has evolved and identifying opportunities to adapt the underlying models. HYDRA builds upon ideas from diagnosis and repair and uses a heuristics-guided search over model changes such that they become competent in novel conditions. The HYDRA framework has been used to implement novelty-aware agents for three diverse domains - CartPole++ (a higher dimension variant of a classic control problem), Science Birds (an IJCAI competition problem), and PogoStick (a specific problem domain in Minecraft). We report empirical observations from these domains to demonstrate the efficacy of various components in the novelty meta-reasoning process.
- Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
- North America > United States > Oklahoma > Payne County > Cushing (0.04)
- Europe > Slovenia > Central Slovenia > Municipality of Komenda > Komenda (0.04)
- (2 more...)
- Leisure & Entertainment > Games > Computer Games (0.66)
- Media > Television (0.46)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- (2 more...)
Fast Offline Policy Optimization for Large Scale Recommendation
Sakhi, Otmane, Rohde, David, Gilotte, Alexandre
Personalised interactive systems such as recommender systems require selecting relevant items from massive catalogs dependent on context. Reward-driven offline optimisation of these systems can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (3 more...)
Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic
Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.